Smoothing

Smoothing is a very powerful technique used all across data analysis. It is designed to estimate \(f(x)\) when the shape is unknown, but assumed to be smooth. The general idea is to group data points into strata that are expected to have similar expectations and compute the average or fit a simple model in each strata. We will use the 2008 presidential election polls.

polls_2008
## Source: local data frame [543 x 14]
## 
##                          Pollster start_date   end_date     N
##                             (chr)     (time)     (time) (int)
## 1                  Marist College 2008-11-03 2008-11-03   804
## 2             GWU (Lake/Tarrance) 2008-11-02 2008-11-03   400
## 3  DailyKos.com (D)/Research 2000 2008-11-01 2008-11-03  1100
## 4                        IBD/TIPP 2008-11-01 2008-11-03   981
## 5                       Rasmussen 2008-11-01 2008-11-03  3000
## 6                             ARG 2008-11-01 2008-11-03  1200
## 7          Reuters/ C-SPAN/ Zogby 2008-10-31 2008-11-03  1226
## 8              Harris Interactive 2008-10-30 2008-11-03  3946
## 9                  Marist College 2008-11-02 2008-11-02   635
## 10                        NBC/WSJ 2008-11-01 2008-11-02    NA
## ..                            ...        ...        ...   ...
## Variables not shown: population_type (chr), McCain (dbl), Obama (dbl),
##   Barr (chr), Nader (chr), Other (chr), Undecided (chr), Margin (chr),
##   diff (dbl), day (dbl)

For each day starting June 1, 2008 we compute the average of polls that started that day. We will denote this predicted difference with \(Y\) and the days with \(X\). Below we create and plot this dataset and fit a regression line.

dat <-  filter(polls_2008, start_date>="2008-06-01") %>% 
  group_by(X=day)  %>% 
  summarize(Y=mean(diff))

dat %>% ggplot(aes(X, Y)) + geom_point() + geom_smooth(method = "lm", se = FALSE)

Note that we model \(f(x) = \mbox{E}(Y \mid X=x)\) with a line we do not appear to describe the trend very well. Note for example that on September 4 (day -62) the Republican Convention was held. This gave McCain a boost in the polls which can be clearly seen in the data. The regression line does not capture this.

To see this more clearly we note that points above the fitted line (green) and those below (purple) are not evenly distributed. We therefore need an alternative more flexible approach.

resids <- ifelse(lm(Y~X, data=dat)$resid >0, "+", "-")
dat %>% mutate(resids=resids) %>% 
  ggplot(aes(X, Y)) + 
  geom_point(cex=5,pch=21) +
  geom_smooth(method = "lm", se = FALSE) +
  geom_point(aes(X,Y,color=resids), cex=4)

We will explore ways of estimating \(f(x)\) that do not assume it is linear.

Bin Smoothing

Instead of fitting a line, let’s go back to the idea of stratifying and computing the mean. This is referred to as bin smoothing. The general idea is that the underlying curve does not vary wildly, what we refer to as smooth. If the curve is enough then in small bins, the curve is approximately constant. If we assume the curve is constant, then all the \(Y\) in that bin have the same expected value. For example, in the plot below, we highlight points in a bin centered at day -125 as well as the points of a bin centered at day -55 , if we use bins of a week. We also show the fitted mean values for the \(Y\) in those bins with dashed lines (code not shown):

By computing this mean for bins around every point, we form an estimate of the underlying curve \(f(x)\). Below we show the procedure happening as we move from the smallest value of \(X\) to the largest.

## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?
## Executing: 
## 'convert' -loop 0 -delay 15 Rplot1.png Rplot2.png Rplot3.png
##     Rplot4.png Rplot5.png Rplot6.png Rplot7.png Rplot8.png
##     Rplot9.png Rplot10.png Rplot11.png Rplot12.png Rplot13.png
##     Rplot14.png Rplot15.png Rplot16.png Rplot17.png Rplot18.png
##     Rplot19.png Rplot20.png Rplot21.png Rplot22.png Rplot23.png
##     Rplot24.png Rplot25.png Rplot26.png Rplot27.png Rplot28.png
##     Rplot29.png Rplot30.png Rplot31.png Rplot32.png Rplot33.png
##     Rplot34.png Rplot35.png Rplot36.png Rplot37.png Rplot38.png
##     Rplot39.png Rplot40.png Rplot41.png Rplot42.png Rplot43.png
##     Rplot44.png Rplot45.png Rplot46.png Rplot47.png Rplot48.png
##     Rplot49.png Rplot50.png Rplot51.png Rplot52.png Rplot53.png
##     Rplot54.png Rplot55.png Rplot56.png Rplot57.png Rplot58.png
##     Rplot59.png Rplot60.png Rplot61.png Rplot62.png Rplot63.png
##     Rplot64.png Rplot65.png Rplot66.png Rplot67.png Rplot68.png
##     Rplot69.png Rplot70.png Rplot71.png Rplot72.png Rplot73.png
##     Rplot74.png Rplot75.png Rplot76.png Rplot77.png Rplot78.png
##     Rplot79.png Rplot80.png Rplot81.png Rplot82.png Rplot83.png
##     Rplot84.png Rplot85.png Rplot86.png Rplot87.png Rplot88.png
##     Rplot89.png Rplot90.png Rplot91.png Rplot92.png Rplot93.png
##     Rplot94.png Rplot95.png Rplot96.png Rplot97.png Rplot98.png
##     Rplot99.png Rplot100.png Rplot101.png Rplot102.png
##     Rplot103.png Rplot104.png Rplot105.png Rplot106.png
##     Rplot107.png Rplot108.png Rplot109.png Rplot110.png
##     Rplot111.png Rplot112.png Rplot113.png Rplot114.png
##     Rplot115.png Rplot116.png Rplot117.png Rplot118.png
##     Rplot119.png Rplot120.png Rplot121.png Rplot122.png
##     Rplot123.png Rplot124.png Rplot125.png Rplot126.png
##     Rplot127.png Rplot128.png Rplot129.png Rplot130.png
##     Rplot131.png 'binsmoother1.gif'
## Output at: binsmoother1.gif

bin_smoother1

The final result looks like this (code not shown):

mod <- ksmooth(dat$X, dat$Y, kernel="box", bandwidth = span)
bin_fit <- data.frame(X=dat$X, .fitted=mod$y)
ggplot(dat, aes(X, Y)) +
    geom_point(cex=5) + geom_line(aes(x=X, y=.fitted),
                             data=bin_fit, color="red")

Kernels

Note that the final project is quite wiggly. One reason for this is that each time the window moves 2 points change. We can attenuate this somewhat by taking weighted averages that give the center point more weight and far away less points.

In this animation we see that points on the edge get less weight:

## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?
## Executing: 
## 'convert' -loop 0 -delay 15 Rplot1.png Rplot2.png Rplot3.png
##     Rplot4.png Rplot5.png Rplot6.png Rplot7.png Rplot8.png
##     Rplot9.png Rplot10.png Rplot11.png Rplot12.png Rplot13.png
##     Rplot14.png Rplot15.png Rplot16.png Rplot17.png Rplot18.png
##     Rplot19.png Rplot20.png Rplot21.png Rplot22.png Rplot23.png
##     Rplot24.png Rplot25.png Rplot26.png Rplot27.png Rplot28.png
##     Rplot29.png Rplot30.png Rplot31.png Rplot32.png Rplot33.png
##     Rplot34.png Rplot35.png Rplot36.png Rplot37.png Rplot38.png
##     Rplot39.png Rplot40.png Rplot41.png Rplot42.png Rplot43.png
##     Rplot44.png Rplot45.png Rplot46.png Rplot47.png Rplot48.png
##     Rplot49.png Rplot50.png Rplot51.png Rplot52.png Rplot53.png
##     Rplot54.png Rplot55.png Rplot56.png Rplot57.png Rplot58.png
##     Rplot59.png Rplot60.png Rplot61.png Rplot62.png Rplot63.png
##     Rplot64.png Rplot65.png Rplot66.png Rplot67.png Rplot68.png
##     Rplot69.png Rplot70.png Rplot71.png Rplot72.png Rplot73.png
##     Rplot74.png Rplot75.png Rplot76.png Rplot77.png Rplot78.png
##     Rplot79.png Rplot80.png Rplot81.png Rplot82.png Rplot83.png
##     Rplot84.png Rplot85.png Rplot86.png Rplot87.png Rplot88.png
##     Rplot89.png Rplot90.png Rplot91.png Rplot92.png Rplot93.png
##     Rplot94.png Rplot95.png Rplot96.png Rplot97.png Rplot98.png
##     Rplot99.png Rplot100.png Rplot101.png Rplot102.png
##     Rplot103.png Rplot104.png Rplot105.png Rplot106.png
##     Rplot107.png Rplot108.png Rplot109.png Rplot110.png
##     Rplot111.png Rplot112.png Rplot113.png Rplot114.png
##     Rplot115.png Rplot116.png Rplot117.png Rplot118.png
##     Rplot119.png Rplot120.png Rplot121.png Rplot122.png
##     Rplot123.png Rplot124.png Rplot125.png Rplot126.png
##     Rplot127.png Rplot128.png Rplot129.png Rplot130.png
##     Rplot131.png 'binsmoother2.gif'
## Output at: binsmoother2.gif

bin_smoother2

Note that the estimate is smoother now.

mod <- ksmooth(dat$X, dat$Y, kernel="normal", 
               bandwidth = span)
bin_fit2 <- data.frame(X=dat$X, .fitted=mod$y)

ggplot(dat, aes(X, Y)) +
    geom_point(cex=5) + geom_line(aes(x=X, y=.fitted), data=bin_fit2, color="red")

There are several functions in R that implement bin smoothers. One example is ksmooth shown above. However, in practice, we typically prefer methods that use slightly more complex models than fitting a constant. The final result above, for example, is still somewhat wiggly. Methods such as loess, which we explain next, improve on this.

Loess

Local weighted regression (loess) is similar to bin smoothing in principle. The main difference is that we approximate the local behavior with a line or a parabola. This permits us to expand the bin sizes, which stabilizes the estimates. Below we see lines fitted to two bins that are slightly larger than those we used for the bin smoother (code not shown). We can use larger bins because fitting lines provide slightly more flexibility.

As we did for the bin smoother, we show 12 steps of the process that leads to a loess fit (code not shown):

span <- 0.05

dat2 <- dat %>%
  inflate(center = unique(dat$X)) %>%
  mutate(dist = abs(X - center)) %>%
  filter(rank(dist) / n() <= span) %>%
  mutate(weight = (1 - (dist / max(dist)) ^ 3) ^ 3)


dat2 %>% filter(center %in% c(-125, -55)) %>%
  ggplot(aes(X, Y)) +   
  geom_point(aes(alpha = weight)) +
  geom_smooth(aes(group = center, frame = center, weight = weight), 
              method = "lm", se = FALSE) +
  geom_vline(aes(xintercept = center, frame = center), lty = 2) +
  geom_point(shape = 1, data = dat) 

Note that now that we are fitting lines instead of constant, we can fit lines to larger windows

And then we fit a line locally at each point and keep the predicted value at that point:

## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?
## Executing: 
## 'convert' -loop 0 -delay 15 Rplot1.png Rplot2.png Rplot3.png
##     Rplot4.png Rplot5.png Rplot6.png Rplot7.png Rplot8.png
##     Rplot9.png Rplot10.png Rplot11.png Rplot12.png Rplot13.png
##     Rplot14.png Rplot15.png Rplot16.png Rplot17.png Rplot18.png
##     Rplot19.png Rplot20.png Rplot21.png Rplot22.png Rplot23.png
##     Rplot24.png Rplot25.png Rplot26.png Rplot27.png Rplot28.png
##     Rplot29.png Rplot30.png Rplot31.png Rplot32.png Rplot33.png
##     Rplot34.png Rplot35.png Rplot36.png Rplot37.png Rplot38.png
##     Rplot39.png Rplot40.png Rplot41.png Rplot42.png Rplot43.png
##     Rplot44.png Rplot45.png Rplot46.png Rplot47.png Rplot48.png
##     Rplot49.png Rplot50.png Rplot51.png Rplot52.png Rplot53.png
##     Rplot54.png Rplot55.png Rplot56.png Rplot57.png Rplot58.png
##     Rplot59.png Rplot60.png Rplot61.png Rplot62.png Rplot63.png
##     Rplot64.png Rplot65.png Rplot66.png Rplot67.png Rplot68.png
##     Rplot69.png Rplot70.png Rplot71.png Rplot72.png Rplot73.png
##     Rplot74.png Rplot75.png Rplot76.png Rplot77.png Rplot78.png
##     Rplot79.png Rplot80.png Rplot81.png Rplot82.png Rplot83.png
##     Rplot84.png Rplot85.png Rplot86.png Rplot87.png Rplot88.png
##     Rplot89.png Rplot90.png Rplot91.png Rplot92.png Rplot93.png
##     Rplot94.png Rplot95.png Rplot96.png Rplot97.png Rplot98.png
##     Rplot99.png Rplot100.png Rplot101.png Rplot102.png
##     Rplot103.png Rplot104.png Rplot105.png Rplot106.png
##     Rplot107.png Rplot108.png Rplot109.png Rplot110.png
##     Rplot111.png Rplot112.png Rplot113.png Rplot114.png
##     Rplot115.png Rplot116.png Rplot117.png Rplot118.png
##     Rplot119.png Rplot120.png Rplot121.png Rplot122.png
##     Rplot123.png Rplot124.png Rplot125.png Rplot126.png
##     Rplot127.png Rplot128.png Rplot129.png Rplot130.png
##     Rplot131.png 'loess.gif'
## Output at: loess.gif

loess

There are three other important differences between loess and the typical bin smoother. The first is that rather than keeping the bin size the same, loess keeps the number of points used in the local fit the same. This number is controlled via the span argument which expects a proportion. For example, if N is the number of data points and span=0.5, then for a given \(x\) , loess will use the 0.5*N closest points to \(x\) for the fit. The second difference is that, when fitting the parametric model to obtain \(f(x)\), loess uses weighted least squares, with higher weights for points that are closer to \(x\). The third difference is that loess has the option of fitting the local model robustly. An iterative algorithm is implemented in which, after fitting a model in one iteration, outliers are detected and down-weighted for the next iteration. To use this option, we use the argument family="symmetric".

The final result is a smoother fit than the bin smoother since we use larger sample sizes to estimate our local parameters:

mod <- loess(Y~X, degree=1, span = span, data=dat)
loess_fit <- augment(mod)

ggplot(dat, aes(X, Y)) +
    geom_point(cex=5) + geom_line(aes(x=X, y=.fitted), data=loess_fit, color="red")

Note that different spans give us different smooths:

## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?
## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?
## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?
## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?
## Executing: 
## 'convert' -loop 0 -delay 15 Rplot1.png Rplot2.png Rplot3.png
##     Rplot4.png Rplot5.png Rplot6.png Rplot7.png Rplot8.png
##     Rplot9.png Rplot10.png Rplot11.png Rplot12.png Rplot13.png
##     Rplot14.png Rplot15.png Rplot16.png Rplot17.png Rplot18.png
##     Rplot19.png Rplot20.png Rplot21.png Rplot22.png Rplot23.png
##     Rplot24.png Rplot25.png Rplot26.png Rplot27.png Rplot28.png
##     Rplot29.png Rplot30.png Rplot31.png Rplot32.png Rplot33.png
##     Rplot34.png Rplot35.png Rplot36.png Rplot37.png Rplot38.png
##     Rplot39.png Rplot40.png Rplot41.png Rplot42.png Rplot43.png
##     Rplot44.png Rplot45.png Rplot46.png Rplot47.png Rplot48.png
##     Rplot49.png Rplot50.png Rplot51.png Rplot52.png Rplot53.png
##     Rplot54.png Rplot55.png Rplot56.png Rplot57.png Rplot58.png
##     Rplot59.png Rplot60.png Rplot61.png Rplot62.png Rplot63.png
##     Rplot64.png Rplot65.png Rplot66.png Rplot67.png Rplot68.png
##     Rplot69.png Rplot70.png Rplot71.png Rplot72.png Rplot73.png
##     Rplot74.png Rplot75.png Rplot76.png Rplot77.png Rplot78.png
##     Rplot79.png Rplot80.png Rplot81.png Rplot82.png Rplot83.png
##     Rplot84.png Rplot85.png Rplot86.png Rplot87.png Rplot88.png
##     Rplot89.png Rplot90.png Rplot91.png Rplot92.png Rplot93.png
##     Rplot94.png Rplot95.png Rplot96.png Rplot97.png Rplot98.png
##     Rplot99.png Rplot100.png Rplot101.png Rplot102.png
##     Rplot103.png Rplot104.png Rplot105.png Rplot106.png
##     Rplot107.png Rplot108.png Rplot109.png Rplot110.png
##     Rplot111.png Rplot112.png Rplot113.png Rplot114.png
##     Rplot115.png Rplot116.png Rplot117.png Rplot118.png
##     Rplot119.png Rplot120.png Rplot121.png Rplot122.png
##     Rplot123.png Rplot124.png Rplot125.png Rplot126.png
##     Rplot127.png Rplot128.png Rplot129.png Rplot130.png
##     Rplot131.png 'loesses.gif'
## Output at: loesses.gif

loess

Final

spans <- c(.66, 0.25, 0.15, 0.10)

fits <- data_frame(span = spans) %>% 
  group_by(span) %>% 
  do(augment(loess(Y~X, degree=1, span = .$span, data=dat)))

ggplot(dat, aes(X, Y)) +
  geom_point(shape=1,cex=3) +
  geom_line(aes(x=X, y = .fitted, frame = X, cumulative = TRUE), data = fits, color = "red") +
  facet_wrap(~span)

Note the ggplot uses loess in its geom_smooth function. But be careful with default behavior. The ggplot

ggplot(dat, aes(X, Y)) +
  geom_point(shape=1) + geom_smooth(color="red")

Multiple predictors

Loess is a powerful tool when we have one predictor. But what if we have more than one? Note that we defined the concepts of windows. How do we define these windows when we have more than one covariate? What is a window when we have 784 predictors? To define this it is helpful to understand the concept or distance

Distance

The concept of distance is quite intuitive. For example, when we cluster animals into subgroups, we are implicitly defining a distance that permits us to say what animals are “close” to each other.

Clustering of animals.

Many of the analyses we perform with high-dimensional data relate directly or indirectly to distance. Many clustering and machine learning techniques rely on being able to define distance, using features or predictors.

Euclidean Distance

As a review, let’s define the distance between two points, \(A\) and \(B\), on a Cartesian plane.

The euclidean distance between \(A\) and \(B\) is simply:

\[\sqrt{ (A_x-B_x)^2 + (A_y-B_y)^2}\]

Distance in High Dimensions

Earlier we introduced training dataset with feature matrix measurements for 784 features for 500 digits.

## Loading required package: lattice
sample_n(train_set,10) %>% select(label, pixel351:pixel360) 
## Source: local data frame [10 x 11]
## 
##    label pixel351 pixel352 pixel353 pixel354 pixel355 pixel356 pixel357
##    (chr)    (int)    (int)    (int)    (int)    (int)    (int)    (int)
## 1      2        0        0        0      128      255      255       64
## 2      7      112      112      112      237      252       84        0
## 3      2      129      254      254      199        0        0        0
## 4      2        0        0        0       63      247      253      211
## 5      7        0       89      249      215       25        0        2
## 6      7       57      253      253      243        0        0        0
## 7      7        5      173      254      132        0        0        0
## 8      7        0        0       31      113      195      232      254
## 9      2       13      119      229      254      173       33        0
## 10     2        0       85      252      252      252        0        0
## Variables not shown: pixel358 (int), pixel359 (int), pixel360 (int)

For the purposes of smoothing, we are interested in describing distance between observation , in this case digits. Later for the purposes of selecting features, we might also be interested in finding pixels that behave similarly across samples.

To define distance, we need to know what the points are since mathematical distance is computed between points. With high dimensional data, points are no longer on the Cartesian plane. Instead they are in higher dimensions. For example, observation \(i\) is defined by a point in 784 dimensional space: \((Y_{i,1},\dots,Y_{i,784})^\top\). Feature \(j\) is defined by a point in 500 dimensions \((Y_{1,j},\dots,Y_{500,j})^\top\)

Once we define points, the Euclidean distance is defined in a very similar way as it is defined for two dimensions. For instance, the distance between two observations, say observations \(i=1\) and \(i=2\) is:

\[ \mbox{dist}(1,2) = \sqrt{ \sum_{j=1}^{784} (Y_{1,j}-Y_{2,j })^2 } \]

and the distance between two features, say, \(15\) and \(273\) is:

\[ \mbox{dist}(15,273) = \sqrt{ \sum_{i=1}^{500} (Y_{i,15}-Y_{i,273})^2 } \]

Example

The first thing we will do is create a matrix with the predictors

X <- select(train_set , pixel0:pixel783) %>% as.matrix()

Rows and columns of matrices can be accessed like this:

thrid_row <- X[3,]
tenth_column <- X[,10]

So the first to observations are 2s and the 253rd is a 7. Let’s see if their distances match this:

X_1 <- X[1,]
X_2 <- X[2,]
X_253 <- X[253,]
sqrt(sum((X_1-X_2)^2))
## [1] 2940.847
sqrt(sum((X_1-X_253)^2))
## [1] 2856.173

As expected, the 2 are closer to each other. If you know matrix algebra, note that a faster way to compute this is using matrix algebra:

sqrt( crossprod(X_1-X_2) )
##          [,1]
## [1,] 2940.847
sqrt( crossprod(X_1-X_253) )
##          [,1]
## [1,] 2856.173

Now to compute all the distances at once, we have the function dist. Because it computes the distance between each row, and here we are interested in the distance between samples, we transpose the matrix

d <- dist(X)
class(d)
## [1] "dist"

Note that this produces an object of class dist and, to access the entries using row and column indices, we need to coerce it into a matrix:

as.matrix(d)[1,2]
## [1] 2940.847
as.matrix(d)[1,253]
## [1] 2856.173

We can quickly see an image of these distances

image(as.matrix(d))

Note that for illustrative purposes we defined two predictors. Defining distances between observations based on these two covariates is much more intuitive since we can simply visualize the distance in a two dimensional plot

ggplot(train_set) + 
  geom_point(aes(X_1, X_2, fill=label), pch=21, cex=5)

Distance between predictors

Perhaps a more interesting result comes from computing distance between predictors:

image(as.matrix(dist(t(X))))

k Nearest Neighbors

K-nearest neighbors (kNN) is similar to bin smoothing, but it is easier to adapt to multiple dimensions. We first define the distance between all observations based on the features.Basically, for any point \(\bf{x}\) for which we want an estimate of \(f(\bf{x})\), we look for the \(k\) nearest points and then take an average of these points. This gives us an estimate of \(f(x_1,x_2)\), just like the bin smoother gave us an estimate of a curve. We can now control flexibility through \(k\).

Let’s use our logistic regression as a straw man:

library(caret)
glm_fit <- glm(y~.,data = select(train_set, y, X_1, X_2) )
f_hat <- predict(glm_fit, newdata = test_set, 
                 type = "response")
tab <- table(pred=round(f_hat), truth=test_set$y)
confusionMatrix(tab)$tab
##     truth
## pred   0   1
##    0 193  54
##    1  49 204
confusionMatrix(tab)$overall["Accuracy"]
## Accuracy 
##    0.794

Now, lets compare to kNN. Let’s start with the default \(k=5\)

knn_fit <- knn3(y~.,data = select(train_set, y, X_1, X_2) )
f_hat <- predict(knn_fit, newdata = test_set)[,2]
tab <- table(pred=round(f_hat), truth=test_set$y)
confusionMatrix(tab)$tab
##     truth
## pred   0   1
##    0 202  49
##    1  40 209
confusionMatrix(tab)$overall["Accuracy"]
## Accuracy 
##    0.822

This already improves over the logistics model. Let’s see why this is:

When \(k=5\), we see some islands of red in the blue area. This is due to what we call over training. Note how that we have higher accuracy in the train set compared to the test set:

f_hat <- predict(knn_fit, newdata = test_set)[,2]
tab <- table(pred=round(f_hat), truth=test_set$y)
confusionMatrix(tab)$overall["Accuracy"]
## Accuracy 
##    0.822
f_hat_train <- predict(knn_fit, newdata = train_set)[,2]
tab <- table(pred=round(f_hat_train), truth=train_set$y)
confusionMatrix(tab)$overall["Accuracy"]
## Accuracy 
##    0.884

Over Training

Over-training is at its worse when we set a \(k=1\). In this case we ill obtain perfect accuracy in the training set because each point is used to predict itself. So perfect accuracy must happen by definition. However, the test set accuracy is actually worse than logistics regression.

knn_fit_1 <- knn3(y~.,data = select(train_set, y, X_1, X_2), k=1)

f_hat <- predict(knn_fit_1, newdata = train_set)[,2]
tab <- table(pred=round(f_hat), truth=train_set$y)
confusionMatrix(tab)$overall["Accuracy"]
## Accuracy 
##        1
f_hat <- predict(knn_fit_1, newdata = test_set)[,2]
tab <- table(pred=round(f_hat), truth=test_set$y)
confusionMatrix(tab)$overall["Accuracy"]
## Accuracy 
##     0.81

We can see the over-fitting problem in this figure:

We can also go over-smooth. Look at what happens with 251 closes neighbors.

knn_fit_251 <- knn3(y~.,data = select(train_set, y, X_1, X_2), k=251)
f_hat <- predict(knn_fit_251, newdata = test_set)[,2]
tab <- table(pred=round(f_hat), truth=test_set$y)
confusionMatrix(tab)$overall["Accuracy"]
## Accuracy 
##      0.8

This turns out to be similar to logistic regression:

We can

control <- trainControl(method='cv', number=2, p=.5)
dat2 <- mutate(dat, label=as.factor(label)) %>%
  select(label,X_1,X_2)
res <- train(label ~ .,
             data = dat2,
             method = "knn",
             trControl = control,
             tuneLength = 1, # How fine a mesh to go on grid
             tuneGrid=data.frame(k=seq(3,151,2)),
             metric="Accuracy")
plot(res)

With k=11 we obtain what appears to be a decent estimate of the true \(f\).

An important part of data science is visualizing results to determine why we are succeeding and why we are failing.

Here are some 2 that were correctly called with high probability:

Here are some 2 that were incorrectly and had high probability:

Here are some for which the predictor was about 50-50

Here are some 7 that were correctly called with high probability:

Here are some 2 that were incorrectly and had high probability: